-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29890][SQL][2.4] DataFrameNaFunctions.fill should handle duplicate columns #27407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you so much, @imback82 . |
|
Also, cc @gatorsmile |
|
Do you think we need to back port #26700, which fixes a very similar issue? |
|
I switched the type of SPARK-30065 to |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. (Pending Jenkins.)
I tested the changed part and the new test cases locally.
Thank you, @imback82 !
|
Test build #117587 has finished for PR 27407 at commit
|
|
Merged to branch-2.4. |
…cate columns (Backport of #26593) ### What changes were proposed in this pull request? `DataFrameNaFunctions.fill` doesn't handle duplicate columns even when column names are not specified. ```Scala val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") val df = left.join(right, Seq("col1")) df.printSchema df.na.fill("hello").show ``` produces ``` root |-- col1: string (nullable = true) |-- col2: string (nullable = true) |-- col2: string (nullable = true) org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.; at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:259) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121) at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221) at org.apache.spark.sql.Dataset.col(Dataset.scala:1268) ``` The reason for the above failure is that columns are looked up with `DataSet.col()` which tries to resolve a column by name and if there are multiple columns with the same name, it will fail due to ambiguity. This PR updates `DataFrameNaFunctions.fill` such that if the columns to fill are not specified, it will resolve ambiguity gracefully by applying `fill` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity). ### Why are the changes needed? If column names are not specified, `fill` should not fail due to ambiguity since it should still be able to apply `fill` to the eligible columns. ### Does this PR introduce any user-facing change? Yes, now the above example displays the following: ``` +----+-----+-----+ |col1| col2| col2| +----+-----+-----+ | 1|hello| 2| | 3| 4|hello| +----+-----+-----+ ``` ### How was this patch tested? Added new unit tests. Closes #27407 from imback82/backport-SPARK-29890. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(Backport of #26593)
What changes were proposed in this pull request?
DataFrameNaFunctions.filldoesn't handle duplicate columns even when column names are not specified.produces
The reason for the above failure is that columns are looked up with
DataSet.col()which tries to resolve a column by name and if there are multiple columns with the same name, it will fail due to ambiguity.This PR updates
DataFrameNaFunctions.fillsuch that if the columns to fill are not specified, it will resolve ambiguity gracefully by applyingfillto all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity).Why are the changes needed?
If column names are not specified,
fillshould not fail due to ambiguity since it should still be able to applyfillto the eligible columns.Does this PR introduce any user-facing change?
Yes, now the above example displays the following:
How was this patch tested?
Added new unit tests.